Segmenting Chinese in Unicode

نویسنده

  • Thomas EMERSON
چکیده

The automatic segmentation of Chinese text is an ongoing problem in information retrieval (IR) and computational linguistics: “words” in written Chinese are not delimited by spaces so tokenizing (the first phase of many IR tasks) is considerably more difficult than for Western languages. This paper presents an overview of the segmentation problem, detailing previous research into its solution and introduces Basis Technology’s Chinese Morphological Analyzer (CMA), a new, general purpose hybrid segmentation system. The CMA is Unicode based, and can handle both Simplified and Traditional Chinese text from a variety of locales, including Mainland China, Taiwan, Hong Kong, and Singapore.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Supporting Chinese Character Variants in Hong Kong through Ideographic Variation Sequence

This paper will introduce an ongoing project in Hong Kong that makes use of the Ideographic Variation Sequence (IVS) and the associated Ideographic Variation Database (IVD) developed by the Unicode Consortium for character glyph registration. Hong Kong uses the traditional Chinese writing system similar to that of Taiwan and thus used the Big5 encoding for many years. But, Chinese characters us...

متن کامل

Building a Collation Element Table for a Large Chinese Character Set in YES

YES is a simplified stroke-based method for sorting Chinese characters. It is free from stroke counting and grouping, and thus much faster and more accurate than the traditional method. This paper presents a collation element table built in YES for a large joint Chinese character set covering (a) all 20,902 characters of Unicode CJK Unified Ideographs, (b) all 11,408 characters in the Complete ...

متن کامل

ALT-J/C - a prototype Japanese-to-Chinese automatic language translation system

This paper describes a prototype Japanese-to-Chinese automatic language translation system. ALT-J/C (Automatic Language Translator Japanese-to-Chinese) is a semantic transfer based system, which is based on ALT-J/E (a Japanese-to-English system), but written to cope with Unicode. It is also designed to cope with constructions specific to Chinese. This system has the potential to become a framew...

متن کامل

Unicode Chinese paleography : making the evolutionary leap from bone , bronze , silk , and paper , to electronic bits Dr . Richard

As more and more rare characters are encoded, Unicode provides better and better support for Chinese. In conjunction with CDL technology, texts of many historical periods can now be digitized with unparalleled accuracy. For even greater accuracy, a move beyond the encoding of modern-style CJK characters is required, and specialists from all over the world have begun to express interest in worki...

متن کامل

A Unicode Based Adaptive Segmentor

This paper presents a Unicode based Chinese word segmentor. It can handle Chinese text in Simplified, Traditional, or mixed mode. The system uses the strategy of divide-and-conquer to handle the recognition of personal names, numbers, time and numerical values, etc in the preprocessing stage. The segmentor further uses tagging information to work on disambiguation. Adopting a modular design app...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000